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Abstract 

Background: Nothofagus nervosa is one of the most emblematic native tree species of Patagonian temperate 
forests. Here, the shotgun RNA-sequencing (RNA-Seq) of the transcriptome of N. nervosa, including de novo 
assembly, functional annotation, and in silico discovery of potential molecular markers to support population and 
associations genetic studies, are described. 

Results: Pyrosequencing of a young leaf cDNA library generated a total of 111,814 high quality reads, with an 
average length of 447 bp. De novo assembly using Newbler resulted into 3,005 tentative isotigs (including 
alternative transcripts). The non-assembled sequences (singletons) were clustered with CD-HIT-454 to identify 
natural and artificial duplicates from pyrosequencing reads, leading to 21,881 unique singletons. 15,497 out of 
24,886 non-redundant sequences or unigenes, were successfully annotated against a plant protein database. A 
substantial number of simple sequence repeat markers (SSRs) were discovered in the assembled and annotated 
sequences. More than 40% of the SSR sequences were inside ORF sequences. To confirm the validity of these 
predicted markers, a subset of 73 SSRs selected through functional annotation evidences were successfully 
amplified from six seedlings DNA samples, being 14 polymorphic. 

Conclusions: This paper is the first report that shows a highly precise representation of the mRNAs diversity 
present in young leaves of a native South American tree, N. nervosa, as well as its in silico deduced putative 
functionality. The reported Nothofagus transcriptome sequences represent a unique resource for genetic studies 
and provide a tool to discover genes of interest and genetic markers that will greatly aid questions involving 
evolution, ecology, and conservation using genetic and genomic approaches in the genus. 

Keywords: Nothofagaceae, Forest genomics, Pyrosequencing, de novo transcriptome assembly, SSRs, Functional 
annotation 



Background 

The Nothofagaceae family contains only the genus 
Nothofagus, and comprises 36 recognized species, 26 of 
which occur in Australia and the remaining 10 in South 
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America [1], Nothofagus in Argentina is represented by 
only six endemic species, distributed on the foothills of 
the Andes and surrounding valleys, beginning with its 
appearance at 36° in the province of Neuquen, and 
extending to 55°S, in the province of Tierra del Fuego 
[2]. 

Among these species, N. obliqua, N. nervosa and N. 
pumilio, occupy a relatively precise range within an alti- 
tudinal gradient spanning from 600 m over the sea level 
up to 1800 m. Along this gradient each species withstand 
different environmental conditions, especially extremely 
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cold temperatures at the higher altitudes. Individual trees 
living in this environmental gradient, exhibit adaptive 
features for adverse conditions such as drought and ex- 
treme temperatures, traits that may prove value for 
adapting to future climate changes in the context of glo- 
bal climate change. 

N. nervosa (Phil) Dim.et Mil [3] (= N. alpina (Poepp. 
&Endl.) Oerst) commonly known as "raulf", is one of the 
most important species of Patagonian Temperate Forests 
due to its wood quality and its relatively fast growth [4]. 
In Argentina it covers a reduced area, only 79,636 hec- 
tares in a narrow fringe of about 120 km in length and 
about 40 km in maximum width [5,6]. This deciduous 
species suffered a great overexploitation in the past due 
to its high wood quality, making necessary to implement 
conservation policies and management programs [7]. 

The distribution of adaptive genetic variation is an 
importance issue in forest species, both native and 
domesticated, serving as a basis for natural resource 
management and conservation genetics [8]. The 
characterization of genetic diversity is also important in 
order to determine its relation with phenotypic vari- 
ation [9]. Massive sequencing techniques are among 
the new strategies used in functional genomics for gene 
discovery and molecular markers development in non- 
model organisms or in those species whose genomes 
have not been completely sequenced. It provides a fast 
and effective way to get new genetic information of an 
organism and allows a rapid access to a collection of 
expressed sequences (transcriptome). 

To date, model forest tree species belonging to Euca- 
lyptus genus [10-12], Pinus, Picea and Populus [13-17] 
have comprehensive transcriptome information. 

The Fagaceae family (represented by the genus Quercus, 
Castanea and Fagus) also holds a large number of 
sequenced transcripts with approximately 2.5 millions of 
ESTs deposited in databases (Fagaceae Genomics Web: 
http://www.fagaceae.org/). At present, new sequencing 

Table 1 N. nervosa transcriptome annotation summary 



technologies offer the possibility to obtain gene catalogs 
for non-model organism which is an opportunity for forest 
tree transcriptome characterization, discovery of alternative 
metabolic strategies and functional molecular markers [9]. 

One of the advantages of transcriptome pyrosequen- 
cing is in terms of sequence reliability. Each region of 
the cDNA is read several times in both strands com- 
pared to one sequence/one strand reading of conven- 
tional ESTs. 

In this study we characterized leaf N. nervosa transcrip- 
tome by pyrosequencing and analyzed the resulting se- 
quence data. Moreover, the functional annotation of the 
unigenes, allowed us to have a global but throughout pic- 
ture of leaf functional gene expression, as well as to de- 
duce the metabolic pathway represented in this dataset. 

This information will significantly contribute to the 
development of Nothofagus functional genomics, genet- 
ics and population-based genome studies. In addition, 
the rather limited set of molecular markers available 
until now: 14 microsatellites isolated from N. cunnigha- 
mii [18], 11 developed in six species of South American 
Nothofagus [19], five in N. nervosa [20], and nine micro- 
satellite loci from N. pumilio [21], will be substantially 
increased with thousands of new markers, both from 
neutral and functional sequences. The quality of the se- 
quence information here reported was confirmed by the 
successful PCR amplification of molecular markers using 
oligonucleotide primers designed with the deduced 
sequences. 

Results and discussion 

Transcriptome sequencing and assembly 

Pyrosequencing of cDNA on a 454 GS FLX Titanium 
(Roche) generated a total of 146,267 raw reads, with an 
average length of 408 bp. After filtering for adaptors, pri- 
mer and low-quality sequences, 5,588 reads were 
removed resulting in 140,679 high quality reads corre- 
sponding to 96% of the first raw sequences, representing 



Number of sequences 



Isotigs (3,005) 



Singletons (21,881) 



Combined set (24,886) 



Viridiplantae-NR 

Sequences with positive BLAST matches 2,762 (92%) 12,735 (58%) 

Sequences annotated with Gene Ontology (GO) terms 2,238 (74%) 9,596 (44%) 

Sequences without detectable BLAST matches 243(8%) 9,146(42%) 

Sequences assigned to know Enzyme Commission category 931 (31%) 1,424(6%) 
Fagaceae 

Sequences with positive BLAST matches 2,923 (97%) 1 7,51 5 (80%) 

Sequences without detectable BLAST matches 82 (3%) 4,365 (20%) 

Sequences annotated with Gene Ontology (GO) terms ("novel genes") 12 (04%) 490 (2%) 



1 5,497 (62%) 
11,834 (47%) 
9,389 (38%) 
2,355 (9%) 

20,438 (82%) 
4,447 (18%) 
502 (2%) 



Numbers and percentages of 454 sequences in the assembled isotigs, singletons and unigenes with significant matches against NCBI NR proteins Viridiplantae 
filtered database and Fagaceae unigenes. 
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approximately 60 Mbp. Raw data (>200 bp) were depos- 
ited in NCBI Sequence Read Archive (SRA) under the 
accession number SRA049632.2. 

By using Newbler Software v. 2.5 (Roche, IN, USA); a 
total of 111,814 sequences were de novo assembled into 
3,394 contiguous sequences (contigs). Overlapping con- 
tigs were assembled into 3,005 isotigs (equivalent to 
unique RNA transcripts). In addition, isotigs originating 
from the same contig-graph were grouped into 2,722 
isogroups (equivalent to genomic locus) by Newbler, po- 
tentially reflecting multiple splice variants. About 28,861 
reads not assembled into isotigs were clustered using 
CD-HIT-454 algorithm to eliminate artificial duplicates 
leaving 21,881 singletons, summing up a total of 24,886 



non-redundant sequences or unigenes (Table 1). All uni- 
gene sequences (isotigs and singletons >200 bp) were 
deposited to the Transcriptome Shotgun Assembly 
(TSA) database, accession numbers JT763459-JT784547. 
Isotig length ranged from 66 bp to 7,093 bp, with an 
overall average length of 765 ± 537 bp (Figure 1A). More 
than 83% of the isotigs were 66 to 1,000 bp long and 
50% of the assembled bases were incorporated into iso- 
tigs greater than 589 bp. The average length of N. ner- 
vosa isotigs (765 bp) was larger than those assembled in 
other non model organisms (e.g.197 bp [22], 440 bp 
[23], 500 bp [24]; 535 bp, [25]), and similar to the aver- 
age isotig length described in Bituminaria bituminosa 
(707 bp [26]). 
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Figure 1 Frequency distribution of isotigs (A) and singletons (B) sequences length. The histograms represent the number of isotig and 
singletons sequences in relation to its length. 
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The coverage depth for isotigs ranged from 2 to 19, 
with an average of 9 contigs assembled into each isotig, 
which is larger than the averages obtained in other 454 
transcriptome analyses (mean = 2.1, [24,25]). 

The length distribution of the 21,881 singletons ranged 
from 50 to 711 bp with an overall average length of 
369.6 bp (Figure IB). The length of 86% of the singletons 
was shorter than 500 bp. 

Functional annotation 

All unique sequences were subjected to BLASTX 
similarity search against the NR protein database (Na- 
tional Center for Biotechnology Information, NCBI), 
with a Viridiplantae filter, to assign a putative function 
[27]. 

Under an E-value threshold of <10 ~ 10 , a total of 2,762 
isotigs (92% of total isotigs) and 12,735 singletons 
sequences (58% of total singletons) had significant 
BLASTX matches (Table 1). The frequency of annotated 
isotigs was significantly higher than the values previously 
reported for de novo transcriptome assemblies of eukar- 
yotes that range from 20 to 40% [22-25]. 

In total, 15,497 unique sequences had at least one hit, 
while the remaining sequences 9,389 (38%) exhibited less 
significant matches (e-value > 10~ 10 ) but still informative 
for identifying putative biological functions in future 
studies in this species. We also performed a BLASTX 
against the NCBI - NR protein database to retrieve 
sequences that did not show BLAST hits against Viridi- 
plantae NCBI, which summed up some few new hits 
(81), but not adding any other valuable annotations. 

The majority of matched sequences exhibited high 
similarity to Vitis vinifera (41%), and Populus tricho- 
carpa (38%) sequences. The top-hit species distribution 
of BLAST matches is shown in Figure 2. 

Annotation and mapping routines were run with 
BLAST2GO platform [28]. Sequences with a positive 
BLAST match were annotated using Gene Ontology 
terms (GO) and Enzyme Commission categories (i.e. EC 
numbers). Thus, GO terms were assigned to 2,238 iso- 
tigs (74%) and 9,596 singletons (44%) totalizing 11,834 
GO terms (Table 1). 

Of the 11,834 GO annotated isotigs and singletons 
sequences, most were assigned to "Biological Processes" 
(7,926 terms), "Molecular functions" (8,229 terms) and 
"Cellular Components" (9,206 terms), (Figure 3). 

BLAST2GO analysis at process level 2, showed that 
among 21 different biological processes most of the tran- 
scripts belonged either to "Metabolic Processes" (5,823), 
to "Cellular Processes" (5,090) and to "Response to Stim- 
uli" (1,493), of which 756 were putative stress-response 
genes (Figure 3A). 

Likewise, the molecular function category subdivided 
annotated sequences into binding (6,985), catalytic 



activity (5,658) and transporters (689) as the most repre- 
sented (Figure 3B). 

A detailed BLAST2GO analysis (level 2) at the cellular 
component category, sorted all transcripts from N. ner- 
vosa into 5 groups being the most representative: cell 
(7,304), organelle (4,822) and macromolecular complex 
component (1,136) (Figure 3C). 

In order to more precisely compare the similarity of 
N. nervosa genes with those of the Fagaceae family 
(from Fagaceae Genomics Web [http://www.fagaceae. 
org/]), N. nervosa unigenes were subjected to BLAT 
(dnax) search against 2,407,823 contigs and singletons 
from American Beech (Fagus grandiflora), American 
Chestnut (Castanea dentate), Chinese Chestnut (Casta- 
nea mollisima) and oak species (Quercus rubra and 
Q. alba). Eighty- two percent of the N. nervosa 
expressed sequences exhibited high similarity to Faga- 
ceae genes. A total of 4,447 (18%) sequences did not 
show matches against Fagaceae sequences, from which 
there were 82 isotigs and 4,365 singletons. Among 
them, 12 isotigs and 490 singletons had distinctive GO 
annotation, which could be considered as novel genes 
for this large group of tree species (Table 1). Most 
interestingly, from these transcripts 21 were found to 
be potentially new genes for stress response (data not 
shown). 

Of the 11,834 sequences annotated with GO terms, 
2,355 were assigned with EC numbers (931 isotigs and 
1,424 singletons) (Table 1). 

The most represented enzymes in all sequences are 
shown in Figure 4: transferase activity (37%), hydrolase 
activity (35%) and oxidoreductase activity (13%) were the 
most abundant. 

To further enhance the annotation of N. nervosa tran- 
scriptome dataset, the 11,834 genes with GO terms were 
mapped to KEGG using KEGG automatic annotation 
server (KAAS) [29]. The identified 58 metabolic path- 
ways include: purine metabolism (411), thiamine metab- 
olism (405), T cell receptor signalling pathway (115), 
biosynthesis of secondary metabolites (58), and mic- 
robial metabolism in diverse environments (37) (see 
Additional file 1). 

We detected as much as 861 chloroplast (cp) 
sequences (150 in isotigs and 711 in singletons), corre- 
sponding to a quite high rate (7%), but this value was 
within the 2 to 10% found in cDNA libraries from all tis- 
sue types, as reported in a study conducted in oak [30]. 

The number of annotated isotigs in this study was 
comparatively larger than that obtained in other similar 
studies [22-25]. These results could be associated with 
the high quality and small number of assembled isotigs, 
which potentially corresponds to highly expressed genes. 
Also the use of specific plant protein sequences and 
close related Fagaceae database possibly increased the 
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Figure 2 Top-hit species distribution of BLASTX matches of W. Nervosa unigenes. Proportion of N. nervosa unigenes (isotigs + singletons) 
with similarity to sequences from NCBI NR protein database (Viridiplantae and whole database). 



BLAST hits. The first assumption comprises technical 
issues such as a high percentage of isotigs that was 
greater than ~600 bp length and with good coverage 
depth. Moreover, the small number of isotigs would be 
detecting the most represented and known expressed 
genes, as it was also shown in the analyses of B. bitumi- 
nosa leaf transcriptome (89.1% annotated contigs) [26]. 
Proportions of best hits in major GO category were gen- 
erally similar to those found in this species, for example, 
binding 48% and catalytic activity 37% in the N. nervosa 
transcriptome survey versus 37% and 37% respectively 
for the same categories in B. bituminosa. 

The second statement relies on the annotation ap- 
proach based on the search against the Viridiplantae 



protein database. This strategy allows to more likely 
finding BLAST hits above the cut off value. In addition, 
a higher percentage of reliable annotated isotigs was 
found when the searched was carried out against the 
Fagaceae protein sequence dataset (Table 1). The favor- 
able effect of using specific databases for annotation was 
also reported for other authors [31-33]. 

Besides, the lower percentage of singletons that were 
annotated was likely due to the high frequency of short 
length sequences, also reported in recent studies [24,34]. 
Fifty percent of non-annotated singletons were shorter 
than 370 bp (data not shown), whereas the 50% in anno- 
tated singletons were longer than 454 bp. Similar results 
were obtained in Pinus contorta where only 5% of 
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Figure 3 Gene Ontology (GO) assignment in level 2 of 1 1,834 N. nervosa unigenes. The total numbers of unigenes annotated for each main 
category are 7,926 for "Biological Process" (A), 8,229 for "Molecular Function" (B), and 9,206 for "Cellular Component" (C). 
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contigs and singletons had BLAST matches when the 
length of the sequences was less than 250 bp [24]. None- 
theless, many singletons were good quality reads and 
matched to proteins in BLAST searches representing to- 
gether with the isotigs, a great source of information. 

Summarizing, the frequency of annotated isotigs and 
singletons was significantly higher than previously 
reported for new generation sequencing de novo tran- 
scriptome assemblies of trees like Pinus contorta [24], or 
two oaks species, Quercus petraea and Q. robur [30], 
even though the high stringency of BLASTX analysis. 

If we assume that the average number of genes 
encoded in a plant nuclear genome is about 30 thou- 
sands (as estimated from seven completely sequenced 
genomes) [34], our annotated dataset likely represents a 
half of the N. nervosa genes catalogue. 

In order to test the presence of expressed repetitive 
sequences, BLASTN (e-value cut off<10e" 50 ) searches 
were performed against all Viridiplantae Repbase (refer- 
ence database of eukaryotic repetitive DNA). A total of 
374 repetitive DNA sequences were found (57 in isotigs 
and 317 in singletons). From all the rRNA sequences, 
255 corresponded to small subunit rRNA (SSUrRNA), 
102 to large subunit rRNA (LSUrRNA) and 17 to trans- 
posable elements. Similar numbers of retrotransposon 
were observed in other plant species (e.g. 15 in Populus 
tremula and Pinus pinaster) [24]. However, in Fago- 
pyrum esculentum and Pinus contorta much more tran- 
scribed retrotransposable elements were found in the 
different tissues sampled [24,34]. 



In silico mining of single sequence repeats (SSRs) 

Using the SSR webserver from the Genome Database for 
Rosaceae (GDR), we identified and characterized several 



SSRs (microsatellites) motives as potential molecular 
markers in the Nothofagus unigene collection. 

The criteria used for SSR selection based on the 
minimum number of repeats was as follows: five for di- 
nucleotide, four for trinucleotide, three for tetranucleo- 
tide and three for penta and hexanucleotide motives. 
These settings resulted in the identification of 3,821 pu- 
tative SSRs within 24,886 unigenes i.e. SSR frequency of 
15% considering multiple occurrences in a same unigene 
element. This was similar than that reported in oak 19% 
by Durand [35] and somewhat lower than 24%, esti- 
mated by Ueno [30]. A total of 3,048 (12%) unigenes 
contained at least one SSR, and 2,517 SSRs (66%) had 
sufficient flanking sequences to allow the design of ap- 
propriate unique primers. Information on the unigene 
identification (ID), marker ID, repeat motive, repeat 
length, primer sequences, positions of forward and re- 
verse primers, and expected fragment length are 
included in Additional file 2. 



Characterization of microsatellite motives 

As expected, the most frequent type of microsatellite 
corresponded to trimeric (37.4%) and dimeric motives 
(32.3%), being tetra-, penta- and hexanucleotide repeats 
present at much lower frequencies (16.3%, 5.2% and 
8.8% respectively, Figure 5). Similar results were found 
in oak [30] (36.6% for trimeric and 36.2%, for dimeric 
motives) with the minimum repeat number of five and 
four for di- and tri-microsatellites, respectively. 

SSR motif combinations can be grouped into unique 
classes based on DNA base complementarities. For ex- 
ample, dinucleotides were grouped into the following 
four unique classes: AT/TA; AG/GA/CT/TC; AC/CA/ 
TG/GT and GC/CG. Thus, the numbers of unique 
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classes possible for di-, tri- and tetra-nucleotide repeats 
are 4, 10, and 33, respectively [36,37]. The AG/CT group 
was the predominant class (56.2%) of the dinucleotide 
repeats, whereas AT (29.2%), AC (14.5%) and CG (0.1%) 
groups were less represented. The frequency of AG was 
similar to the highest value reported by Kumpatla [38] 
(14.6%-54.5% of the total SSRs observed in 55 dicotyle- 
donous species) but lower than that found in Oak 
(70.5%) [30] and eucalypts (91%) [39]. 

The most frequent trimeric SSR motives were the AAG/ 
CTT (27.8%), ATG/CAT (15.2), AGC/GCT (12.6%) and 
AGG/CCT (11.6%), similar to the first category found in 
oak (26.8%) [30]. Within tetrameric motives, AAAT repeat 
was found to be the most abundant (32.9%), followed by 
AAAG (22.7%) and AACA (11.6%). 

The topography of SSR distribution was analyzed for 
SSR presence within UTRs and coding sequence regions. 
About 45% of the SSR sequences were inside ORF 
sequences. Most trinucleotide repeats were found in 
ORFs (52%), while dinucleotides were more frequent in 
the UTRs (40%), similar to those reported in oak [30] 
and pines [40]. It is expected that tri- and hexanucleotide 



5 SSR 

and penta-nucleotide SSRs in unigenes containing 



repeats would occur more frequently than other motifs 
in coding sequences. Such dominance of triplets over 
other repeats in coding regions may be explained on the 
basis of the selective disadvantage of non-trimeric SSR 
variants in coding regions, possibly causing frame-shift 
mutations [41]. 

Validation of the predicted microsatellite markers 

Seventy three microsatellites were selected according to 
their sequence length, GC content and functional anno- 
tation related to abiotic stress category. 

From these, 57% were located in coding regions. The 
73 loci were tested for successful PCR amplification in 
six individuals. All of them were effectively amplified 
validating the quality of the assembly and the utility of 
the SSRs produced. A similar research carried using 
Illumina sequencing technology in sesame showed that 
about 90% primer pairs successfully amplified DNA 
fragments [42]. On the other side, the rate of SSR val- 
idation was lower (64.9%) when the marker mining was 
done using EST produced by Sanger technology [39] 
possibly because of low-quality EST sequences, and/or 
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primer sequences derived from chimerical cDNA 
clones. 

About 20% (14 SSR) of the tested Nothofagus SSRs 
were polymorphic and showed at least one individual 
that differed in allelic composition. 

This relative low percentage of polymorphic loci could 
be explained because of the small sample size tested (six 
seedlings), in contrast to the 46% found in E. globulus 
[39] evaluated in 8 samples, and the 80% found in ses- 
ame [42] essayed in 24 samples. 

Nine of the polymorphic SSR found in this work were 
located within predicted ORF and seven had repeat 
motives multiple of three (Table 2), according to their 
presence in coding regions [41]. 

Conclusions 

The transcriptome database obtained and characterized 
here represents a major contribution for N. nervosa gen- 
omics and genetics. It will be useful for discovering 
genes of interest and genetic markers to investigate 
functional diversity in natural populations, and as well 
as conduct comparative genomics studies in southern 
beeches taking advantage of their remarkable ecophysio- 
logical differences. This work highlights the utility of 
transcriptome high performance sequencing as a fast 
and cost effective way for obtaining rapid information 
on the coding of genetic variation in Nothofagus genus. 
This study allowed us to: (i) obtain 146,267 transcript 
raw reads and 24,886 unigene sequences from N. ner- 
vosa, (ii) identify putative function in 15,497 unigenes 
for the genus that potentially represent 50% of N. ner- 
vosa transcriptome, (iii) identify 756 putative stress- 
response genes (21 non described in Fagaceae), (iv) dis- 
cover 2,517 SSRs with designed primers and (v) detect 
14 polymorphic SSR related to stress response. 

Methods 

RNA preparation and cDNA library synthesis 

Total RNA was prepared by the method of Chang and 
collaborators [43] from leaves of one single seedling. 
One gram of fresh tissue was used, ground to a fine 
powder under liquid nitrogen. Then, after 2 extractions 
with chloroform, RNA was precipitated with LiCl2, 
extracted again with chloroform and finally precipitated 
with ethanol. The resultant RNA was resuspended in 
50 ul of DEPC treated water. RNA was quantified using 
a Nanodrop 1,000 spectrophotometer and the quality 
was measured with a 2,100 Bioanalyzer (Agilent Tech- 
nologies Inc.) Total RNA isolated was purified using the 
Poly (A) Purist kit (Ambion) and the quality assessed 
with a 2,100 Bioanalyzer (Agilent Technologies). cDNA 
was synthesized using cDNA Kit (Roche) and used to 
construct a shotgun library for pyrosequencing technol- 
ogy (Roche). Nothofagus cDNA library was subjected to 



a 1/3 of plate production run on the 454-GS-FLX se- 
quencing instrument. 454 library and sequencing was 
conducted at INDEAR (Rosario Biotechnology Institute, 
Rosario, Argentina). 

Transcript assembly and analysis 

After removing low quality sequences, filtering for adap- 
tors and primers, curated raw 454 read sequences were 
assembled into contigs, isotigs and isogroups using New- 
bler Assembler software 2.5pl (Roche, IN, USA). Reads 
identified like singletons (i.e., reads not assembled into 
isotigs) after assembly, were subjected to CD-HIT-454 
clustering algorithm using a sequence identity cut-off of 
90%, which eliminates redundant sequences or artificial 
duplicates. 

BLASTX (e-value cut off < 10e" 10 ) searches were per- 
formed against Viridiplantae protein database first, then 
the sequences with no hits were used to perform a suc- 
cessive BLASTX against the NCBI nr protein database 
in order to make an assessment of the putative identities 
of the sequences. Also we performed a pairwise align- 
ment using BLAT (dnax) against the Fagaceae family 
sequences to search expressed sequence exclusively for 
N. nervosa. Annotation and mapping routines were run 
with BLAST2GO, which assigns Gene Ontology (GO; 
http://www.geneontology.org) annotation, KEGG maps 
(Kyoto Encyclopedia of Genes and Genomes, KASS) and 
an enzyme classification number (EC number) using a 
combination of similarity searches and statistical analysis 
[29]. 

To search for chloroplast sequences we performed 
BLASTN and TBLASTX (BLASTN e' 50 , TBLASTX 
10e' 10 ) by similarity (with and without translation) to 
109 chloroplasts (nt and aa) from chloroplast genome 
data base (http://chloroplast.cbio.psu.edu/organism.cgi). 

SSR discovery 

In order to identify SSRs for all possible combinations of 
dinucleotide, trinucleotide, tetranucleotide and pentanu- 
cleotide repeats the SSR webserver (GDR) was run 
(http://www.rosaceae.org/bio/content?title=&url=/cgi-bin/ 
gdr/gdr_ssr). The same tool used GETORF algorithm 
(EMBOSS Package) to selected the longest ORF as the pu- 
tative coding region, and Primer 3 (v.0.4.0) [44] to design 
primer pairs. 

The presence of expressed repetitive DNA was per- 
formed using the BLASTN (e-value cut off <10e' 10 ) 
searches against all Viridiplantae Repbase and CEN- 
SOR [45], a software tool that screens query sequences 
against a reference collection of repeats, and "censors" 
(masks) homologous portions with masking symbols, 
as well as generating a report classifying all found 
repeats. 



Table 2 Polymorphic SSRs primer pairs derived from N. nervosa unigenes 



ID name 


Locus 


Repeat 


ORF 


Forward and Reverse Primers 


Amplicon 


BLASTX, seq 


Seq 


Sim 


GO terms related to 






motif 






length 
observed 


description 


Lenght 
(bp) 


mean 

(%) 


response to stress 


isotig00192 


INTANOT1 


(tct)5 


Y 


k LCAGAIGGG I I I IGCI IG 
R: GACGATGAAGACGATGAGG 


148 


heat shock protein 81-1 


2309 


972 


response to stimulus 


isotig00230 


INTANOT2 


(tcg)5 


N 


F: TTTCCAAACGGTTCGAGAAG 
R:AACGGAGAAGGATGTTTCCA 


120 


af367280_1 at3g56860 
t8m16_!90 


1229 


76.6 


response to stress 


isotig00551 


INTANOT3 


(tcattt)3 


Y 


F: GCGATGTGATCGATAGGCTF 
R: CATGTCCCCAGTTCACCTGT 


204 


ac005850_9highly simlilar 
to mlo proteins 


1759 


77.5 


defense response to fungus 


isotig00597 


INTANOT4 


(ta)6 


N 


F:AAAACACCACGAAACCCAAA 
R: CTTTGCGAGGGCAACTAAAT 


197 


dnaj heat shock n-terminal 
domain-containing protein 


1516 


78.3 


response to stimulus 


isotigOl 207 


INTANOT5 


(tct)7 


N 


F: CTCGAAGACGCTAGCAGACC 
R: TCCTGGG I I I IGCATATTGG 


280 


af214107_1 -like protein 


748 


79.3 


response to stimulus 


isotigOl 232 


INTANOT6 


(atc)4 


Y 


F: CGTTTCCCnTAGCTGATGC 
R:GCTGAGTTAGCAATGGAGGG 


173 


aldh6b2 3-chloroallyl aldehyde 
dehydrogenase 
methylmalonate-semialdehyde 
dehydrogenase oxidoreductase 


741 


96.8 


response to stress 


GR7D2IN01BK031 


INTANOT7 


(ag)5 


N 


F: GACGACATCGTTCCGAGTTT 
R: GTTAATCCCTCTCTCCTCAT 


241 


f-box family protein 


536 


75.4 


response to heat 


GR7D2IN01CGQUT 


INTANOT8 


(ccgaaa)3 


Y 


F: CTCCCTCAAACACGTGCAAA 
R: ATFCAAGTGGGTCTTGCCTG 


236 


mitogen-activated protein kinase kinase 


518 


90.5 


response to osmotic stress 


GR7D2IN01 EMGEO 


NTANOT9 


(ct)8 


N 


F: GCGGCTACCTG I I IGI I I I A 
R: TTCCTTGATGATTGTTCGGG 


155 


at 1g 78870 f9k20_8 


507 


100.0 


response to metal ion 


GR7D2IN02FPPC7 


INTANOT10 


(ggt)6 


Y 


F: AAAATFGCTGTTGAGGGTGG 
R: CCTGAATCACCAGACCGAC 


117 


af361609_1at1g27760 t22c5_5 


529 


87.9 


response to osmotic stress 


GR7D2IN02GFAUT 


NTANOT1 1 


(gaa)4 


Y 


F: ATCCCCAATCTTFCCCAATC 
R: AA^CTGTCCGCTTTGGCTA 


115 


salt overly sensitive 1 


315 


78.5 


response to reactive 
oxygen species; response 
to osmotic stress 


GR7D2IN02GR6NZ 


INTANOT12 


(at)5 


Y 


F: TCTTGTGGCAAGTGCTTGAG 
R: AGTATCCTGACGGTTGCCTG 


285 


win2_soltu ame: 
full = wound-induced 
protein win2 flags: precursor 


472 


94.0 


defense response 


GR7D2IN02HOKOI 


INTANOT13 


(tc)5 


Y 


F: ATATCCTGGAAATGCTTGCG 
R: TAAACGATCTFCGGAATGGG 


124 


exec1_arath ame: full = 
protein executer 
chloroplastic flags: precursor 


469 


71.7 


response to reactive 
oxygen species 


GR7D2IN02HWXOR 


NTANOT14 


(tgg)8 


Y 


F: AGGAGCTAAATGGGCGTAA 
R: CACCACGAGCAGCAAAGAA 


260 


glycine-rich rna-binding protein 


452 


86.5 


response to stress 



Included are ID names, primer names, motive and number of repeats, position in ORF, sequence of forward and reverse primers (5 'SO, amplicon length (bp), BLASTX similarity matches (Putative Function), Sequence 
length, Similarity Mean (%), GO terms related to stress response. 
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SSR validation 

For validation of SSR primers, total DNA was extracted 
from young leaves of six N. nervosa seedlings using the 
Dneasy Plant mini kit (Qiagen), following the manufac- 
turer's instructions. 

Regular primers at small scale were synthesized 
(AlphaDNA, Montreal, CA, USA) and used for PCR 
amplification. PCR reactions consisted of 20 ng total, 
0.25 uM of each primer, 3 mM MgCl 2 , 0.2 mM of each 
dNTP, IX PCR buffer and 1 U Platinum Taq polymerase 
(Invitrogen). All polymerase chain reactions amplifica- 
tions were performed with the following conditions: de- 
naturation step of 2 min at 94°C, a regular touchdown 
PCR ranging from 60°C to 50°C (except INTANOT14 
(annealing at 55°C)) with 28 cycles at the touchdown 
temperature of 50°C according to: 45 s at 92°C, 45 s at 
50°C and 45 s at 72°C. The final extension step was of 
10 min at 72°C. Samples were mixed with denaturing 
loading buffer, incubated for 5 min at 95°C, and sepa- 
rated on a 6% polyacrylamide gel. Amplification pro- 
ducts were stained using the DNA silver staining 
procedure of Promega, USA, following the manufac- 
turer's instructions. Details of primers sequences, SSR 
location and amplicon sizes are described in Table 2. 
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